Course: Applied Geo-data Science at the University of Bern (Institute of Geography)

Supervisor: Prof. Dr. Benjamin Stocker

Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider

Further information: https://geco-bern.github.io/agds/

Do you have questions about the workflow? Contact the Author:

Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)

Matriculation number: 20-100-178

Reference: Report Exercise 4 (Chapter 8)

1 Introduction

1.1 Objectives

Modeling is an important aspect of data science. But there are different types of models with different levels of complexity. With a realistic problem, the following goals should be trained:

  • Understand the basics of regression and classification models.

  • Fit linear and logistic regression models in R.

  • Choose and calculate relevant model performance metrics.

  • Evaluate and compare regression models.

  • Detect data outliers.

  • Select best predictive variables.

1.2 Theory

In environmental and climate science, dozens of variables have been measured. With a model, we want to try to predict a target. For that, we use a linear model. Unfortunately, we do not know how many variables and which variables we should use. We only know that we try to find a formula like this:

\[ \hat{y}=a_0 +\sum_{n=1}^{N}\hat{a}_n*x_n \]

2 Methods

To build up the best linear regression model, we could, of course, simply try all combinations by permutation. But that would be a waste of resources. However, we cannot avoid a try-and-error approach. But we would like to reduce the number of combinations. We implement an algorithm that works as follows:

  1. Let \(p\in[1,N]\) and a predictor. Let \(N\in[1,max(variable)]\). N could be all the variables we have or only a subset of them. Our algorithm will decide this (at the start, we did not know).

  2. The algorithm takes p and creates a linear model. The p with the highest RSQ will be used as a predictor. We delete the predictor from our list because we can use it only once.

  3. The algorithm takes p+1 and creates a linear model. The p with the highest RSQ will be used as a predictor. Select the model with the highest RSQ and calculate the AIC.

  4. Iterate step 3 until the AIC of the model does not decrease anymore.

  5. We have found the (presumably) optimal model.

For the first task, we apply only the first 2 steps because we want only a bivariate model. For the second task, we iterate step 3. If we do that, we might also be able to choose the best multivariate model.

2.1 R: Evaluation of the data

The open-source program R-Studio (Version 2022.12.0+353) was used for all studies. The data handling in this program is done using the R language. We utilize the R-package “Tidyverse” and other packages to process the data effectively.

3 Programming and data evaluation

3.1 Packages

The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to choose if different functions have the same call, but do not make the same thing (a function in a certain package can have the same name as a function from another package). In this case, we will set preferences.

# Load the file
source("../../R/general/packages.R")

3.2 Read the file

First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented an if-else statement.

# We want to know, if a certain file already exists
name.of.file <- "../../data/re_stepwise/FLX_CH_stepwise.csv"

# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
  # Access to the data
  url <- "https://raw.githubusercontent.com/geco-bern/agds/main/data/df_for_stepwise_regression
  .csv"
  
  # Read the data directly from an URL
  database.S1 <- read.table(url, header = TRUE, sep = ",")

  # Write a CSV file in the repository
  write_csv(database.S1,"../../data/re_stepwise/FLX_CH_stepwise.csv")
  
  # Read the file
  database <- read_csv("../../data/re_stepwise/FLX_CH_stepwise.csv")

  # If exists such a file, read it only!
  }else{database <- read_csv("../../data/re_stepwise/FLX_CH_stepwise.csv")}

3.3 Data Overview

3.3.1 Missing values

To make a good decision about the predictors, we need as much information as possible. For that, we read our file as a data frame. After that, we want an overview of the data (what variables contain the data set, and what are the main statistical parameters?)

# Check wheater there are any missing values (We try with only the first six rows)
head(apply(database,2, function(x) is.na(x)))
##      siteid TIMESTAMP  TA_F SW_IN_F LW_IN_F VPD_F  PA_F   P_F  WS_F TA_F_MDS
## [1,]  FALSE     FALSE FALSE   FALSE   FALSE FALSE FALSE FALSE FALSE    FALSE
## [2,]  FALSE     FALSE FALSE   FALSE   FALSE FALSE FALSE FALSE FALSE    FALSE
## [3,]  FALSE     FALSE FALSE   FALSE   FALSE FALSE FALSE FALSE FALSE    FALSE
## [4,]  FALSE     FALSE FALSE   FALSE   FALSE FALSE FALSE FALSE FALSE    FALSE
## [5,]  FALSE     FALSE FALSE   FALSE   FALSE FALSE FALSE FALSE FALSE    FALSE
## [6,]  FALSE     FALSE FALSE   FALSE   FALSE FALSE FALSE FALSE FALSE    FALSE
##      SW_IN_F_MDS LW_IN_F_MDS VPD_F_MDS CO2_F_MDS PPFD_IN GPP_NT_VUT_REF USTAR
## [1,]       FALSE        TRUE     FALSE     FALSE    TRUE           TRUE FALSE
## [2,]       FALSE        TRUE     FALSE     FALSE    TRUE           TRUE FALSE
## [3,]       FALSE        TRUE     FALSE     FALSE    TRUE          FALSE FALSE
## [4,]       FALSE        TRUE     FALSE     FALSE    TRUE          FALSE FALSE
## [5,]       FALSE        TRUE     FALSE     FALSE    TRUE          FALSE FALSE
## [6,]       FALSE        TRUE     FALSE     FALSE    TRUE          FALSE FALSE
# Load the function
source("../../R/general/function.visualize.na.values.R")

# Function call to visualize the NA
visualize.na.values.without.groups(database)

# Take an overview and main statistic parameters
summary(database)
##     siteid            TIMESTAMP               TA_F            SW_IN_F       
##  Length:23011       Min.   :1996-01-01   Min.   :-29.184   Min.   :  0.156  
##  Class :character   1st Qu.:2002-12-01   1st Qu.:  0.431   1st Qu.: 52.933  
##  Mode  :character   Median :2007-02-15   Median :  7.286   Median :120.679  
##                     Mean   :2006-10-25   Mean   :  6.950   Mean   :137.633  
##                     3rd Qu.:2011-01-23   3rd Qu.: 13.313   3rd Qu.:215.243  
##                     Max.   :2014-12-31   Max.   : 31.166   Max.   :398.192  
##                                                                             
##     LW_IN_F          VPD_F             PA_F             P_F         
##  Min.   :138.1   Min.   : 0.000   Min.   : 80.37   Min.   :  0.000  
##  1st Qu.:265.0   1st Qu.: 0.831   1st Qu.: 84.31   1st Qu.:  0.000  
##  Median :300.1   Median : 2.542   Median : 97.38   Median :  0.010  
##  Mean   :295.8   Mean   : 3.787   Mean   : 93.47   Mean   :  2.320  
##  3rd Qu.:328.9   3rd Qu.: 5.239   3rd Qu.: 98.77   3rd Qu.:  1.749  
##  Max.   :420.1   Max.   :33.290   Max.   :103.28   Max.   :186.400  
##                                                                     
##       WS_F          TA_F_MDS        SW_IN_F_MDS       LW_IN_F_MDS   
##  Min.   :0.106   Min.   :-29.184   Min.   :  0.156   Min.   :138.1  
##  1st Qu.:1.805   1st Qu.:  0.446   1st Qu.: 53.317   1st Qu.:269.5  
##  Median :2.387   Median :  7.356   Median :120.031   Median :303.3  
##  Mean   :2.670   Mean   :  6.987   Mean   :137.482   Mean   :299.4  
##  3rd Qu.:3.286   3rd Qu.: 13.301   3rd Qu.:215.580   3rd Qu.:332.9  
##  Max.   :9.928   Max.   : 31.166   Max.   :398.192   Max.   :420.1  
##                  NA's   :325       NA's   :341       NA's   :11057  
##    VPD_F_MDS        CO2_F_MDS        PPFD_IN        GPP_NT_VUT_REF  
##  Min.   : 0.000   Min.   :297.3   Min.   : -0.249   Min.   :-6.683  
##  1st Qu.: 0.852   1st Qu.:372.1   1st Qu.: 91.040   1st Qu.: 0.802  
##  Median : 2.612   Median :383.7   Median :231.956   Median : 2.845  
##  Mean   : 3.850   Mean   :385.1   Mean   :271.714   Mean   : 3.503  
##  3rd Qu.: 5.326   3rd Qu.:395.6   3rd Qu.:430.028   3rd Qu.: 5.600  
##  Max.   :33.290   Max.   :724.2   Max.   :995.106   Max.   :18.511  
##  NA's   :768      NA's   :290     NA's   :4574      NA's   :2426    
##      USTAR       
##  Min.   :0.0749  
##  1st Qu.:0.2850  
##  Median :0.3856  
##  Mean   :0.4255  
##  3rd Qu.:0.5244  
##  Max.   :1.5941  
##  NA's   :2468

3.4 Bivariate linear regression

We chose the best variable for our predictor. For that, we tried all variables and chose the one with the highest RSQ (see introduction).

# Access the outsourced function for the bivariate model
source("../../R/re_stepwise/function.bivariate.model.R")

# Function call
tibble.bi.model <- model.fitter(database, 16)
knitr::kable(tibble.bi.model)
Target Predictor RR AIC Fit
GPP_NT_VUT_REF siteid 0.0458181 46735.32 NO
GPP_NT_VUT_REF TIMESTAMP 0.0047922 47597.89 NO
GPP_NT_VUT_REF TA_F 0.3871016 37619.26 NO
GPP_NT_VUT_REF SW_IN_F 0.4306405 36102.41 NO
GPP_NT_VUT_REF LW_IN_F 0.1676259 43919.98 NO
GPP_NT_VUT_REF VPD_F 0.1908666 43337.05 NO
GPP_NT_VUT_REF PA_F 0.0002403 47691.83 NO
GPP_NT_VUT_REF P_F 0.0020894 47653.72 NO
GPP_NT_VUT_REF WS_F 0.0082500 47526.24 NO
GPP_NT_VUT_REF TA_F_MDS 0.3871137 37559.45 NO
GPP_NT_VUT_REF SW_IN_F_MDS 0.4300828 36065.75 NO
GPP_NT_VUT_REF LW_IN_F_MDS 0.1209379 25130.95 NO
GPP_NT_VUT_REF VPD_F_MDS 0.1811937 42567.59 NO
GPP_NT_VUT_REF CO2_F_MDS 0.0295658 47078.98 NO
GPP_NT_VUT_REF PPFD_IN 0.4520702 29689.54 BEST FIT
GPP_NT_VUT_REF GPP_NT_VUT_REF NA NA NA
GPP_NT_VUT_REF USTAR 0.0000316 45048.57 NO

3.4.1 Visualization of the bivariate model

There are meaningful scatterplots for bivariate LM models, and they give us a first impression. But we want to know more about the coefficients, the intercept, the distribution, and the outliers. For that, we use summary() and plot the model (see discussion).

# Calculate the probebly best multivariate model
best.bi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN, data = database)

# Load the function
source("../../R/re_stepwise/function.vis.bi.model.R")

# Plot the bivariate model
vis.bi.model(database)

# Overview about all important model parameter
# Model Call. We are also interested for the p-values
summary(best.bi.lm)
## 
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN, data = database)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0475 -1.3255 -0.3758  1.2780 12.3295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.72533    0.03057   23.73   <2e-16 ***
## PPFD_IN      0.01062    0.00009  117.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 16872 degrees of freedom
##   (6137 observations deleted due to missingness)
## Multiple R-squared:  0.4521, Adjusted R-squared:  0.452 
## F-statistic: 1.392e+04 on 1 and 16872 DF,  p-value: < 2.2e-16
# Plot the model 
plot(best.bi.lm)

3.4.2 Development of RSQ

What is happening with RSQ? Here we visualize the RSQ for each variable. We can easily see that PPFD_IN has the highest RSQ, and that is why we chose it as our predictor. But what if we want a multivariate regression? One might think that we should simply choose the variable with the second-highest RSQ. But we will see that this is generally not the case.

# Load the function
source("../../R/re_stepwise/function.vis.and.dev.bi.R")

# Function call
Vis_of_the_develop.bi.model(tibble.bi.model)

3.5 Multivariate linear regression

Here we want to answer the question of which predictors we should choose to get the probably best linear regression model. For that, we have written a function. It works as described in the introduction. We make two function calls. For the first, we use all variables but the target. This is, however, not necessarily the best way. We could say that the time factor and the location are irrelevant, so we drop these variables. We can see that the difference is huge.

# Access the outsourced function of the multivariate model
source("../../R/re_stepwise/function.multivariate.model.R")

# Function call with all column but target (column number 16)
multi.model.1 <- multivariate.model(database, c(16))
## [1] "Best model fit has AIC = 21762.1756977506"
##  [1] "PPFD_IN"   "LW_IN_F"   "siteid"    "VPD_F"     "TA_F_MDS"  "SW_IN_F"  
##  [7] "WS_F"      "CO2_F_MDS" "P_F"       "PA_F"
# Function call without the variables siteid, TIMESTAMP and GPP_NT_VUT_REF
multi.model.2 <- multivariate.model(database, c(1,2,16))
## [1] "Best model fit has AIC = 24469.2854126171"
## [1] "PPFD_IN"   "LW_IN_F"   "VPD_F"     "TA_F_MDS"  "SW_IN_F"   "P_F"      
## [7] "WS_F"      "CO2_F_MDS" "PA_F"
# Overview about the models
knitr::kable(multi.model.1)
Predictors RSQ AIC
PPFD_IN 0.4520702 29689.54
LW_IN_F 0.5301754 27096.52
siteid 0.5981762 24464.35
VPD_F 0.6160337 23699.27
TA_F_MDS 0.6454925 22354.30
SW_IN_F 0.6513971 22072.88
WS_F 0.6564185 21830.06
CO2_F_MDS 0.6571345 21796.85
P_F 0.6577174 21770.14
PA_F 0.6579195 21762.18
TIMESTAMP 0.6579273 21763.80
knitr::kable(multi.model.2)
Predictors RSQ AIC
PPFD_IN 0.4520702 29689.54
LW_IN_F 0.5301754 27096.52
VPD_F 0.5746412 25420.80
TA_F_MDS 0.5881248 24879.25
SW_IN_F 0.5918328 24728.65
P_F 0.5945354 24618.55
WS_F 0.5963983 24542.84
CO2_F_MDS 0.5981106 24473.10
PA_F 0.5982491 24469.29
TA_F 0.5982635 24470.68

3.5.1 Development of RSQ and AIC

We visualize the development of the model. We can see that the model with more variables has a higher RSQ. That is not a surprise because RSQ will always increase if we add a variable. Important is the AIC. We can see that the AIC is lower for more variables (21762.2 vs. 24469.3). The RSQ is higher (0.6579 vs. 0.5983). It follows that model 1 is better than model 2. We see that also in the coefficients.

# Load the function
source("../../R/re_stepwise/function.vis.and.dev.multi.R")

# Function call
vis.and.dev.multi.model(multi.model.1)

vis.and.dev.multi.model(multi.model.2, c("[without siteid and TIMESTAMP]"))

3.5.2 Visualization of the multivariate model

There are no meaningful scatterplots for multidimensional models. But we want to know more about the coefficients, the intercept, the distribution, and the outliers. We make some calls by using summary(). We get the same values for the RSQ. This is proof that our algorithm works. We see that not all coefficients are significant.

# Calculate the probably best multivariate model
best.multi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS + 
                      SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F + siteid + TIMESTAMP, 
                    data = database)

second.multi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS + 
                      SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F, 
                    data = database)

# Overview about all important model parameter
# Model Call
summary(best.multi.lm)
## 
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS + 
##     SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F + siteid + TIMESTAMP, 
##     data = database)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3803 -1.1726 -0.0431  1.1877 11.4929 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.833e+00  1.609e+00  -3.004  0.00267 ** 
## PPFD_IN       3.098e-03  4.833e-04   6.411 1.49e-10 ***
## LW_IN_F       7.823e-03  6.821e-04  11.469  < 2e-16 ***
## VPD_F        -3.238e-01  7.285e-03 -44.455  < 2e-16 ***
## TA_F_MDS      1.659e-01  4.972e-03  33.361  < 2e-16 ***
## SW_IN_F       1.680e-02  9.998e-04  16.808  < 2e-16 ***
## P_F          -1.290e-02  2.495e-03  -5.172 2.35e-07 ***
## WS_F         -1.900e-01  1.355e-02 -14.020  < 2e-16 ***
## CO2_F_MDS    -3.826e-03  8.153e-04  -4.693 2.71e-06 ***
## PA_F          5.735e-02  1.827e-02   3.139  0.00170 ** 
## siteidCH-Lae  8.313e-01  1.889e-01   4.400 1.09e-05 ***
## siteidFI-Hyy -5.900e-02  2.974e-01  -0.198  0.84274    
## siteidFR-Pue -1.834e+00  2.782e-01  -6.590 4.52e-11 ***
## TIMESTAMP    -6.450e-06  1.047e-05  -0.616  0.53783    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.905 on 16860 degrees of freedom
##   (6137 observations deleted due to missingness)
## Multiple R-squared:  0.6579, Adjusted R-squared:  0.6577 
## F-statistic:  2494 on 13 and 16860 DF,  p-value: < 2.2e-16
summary(second.multi.lm)
## 
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS + 
##     SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F, data = database)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7477 -1.2228 -0.1153  1.1085 12.3241 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.5891656  0.4463032  -3.561 0.000371 ***
## PPFD_IN      0.0067222  0.0005163  13.021  < 2e-16 ***
## LW_IN_F      0.0152432  0.0006854  22.240  < 2e-16 ***
## VPD_F       -0.3885851  0.0077764 -49.970  < 2e-16 ***
## TA_F_MDS     0.1035322  0.0049296  21.002  < 2e-16 ***
## SW_IN_F      0.0127810  0.0010750  11.889  < 2e-16 ***
## P_F         -0.0233915  0.0026829  -8.719  < 2e-16 ***
## WS_F        -0.1374643  0.0138279  -9.941  < 2e-16 ***
## CO2_F_MDS   -0.0064341  0.0007675  -8.383  < 2e-16 ***
## PA_F         0.0077160  0.0032004   2.411 0.015922 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.064 on 16864 degrees of freedom
##   (6137 observations deleted due to missingness)
## Multiple R-squared:  0.5982, Adjusted R-squared:  0.598 
## F-statistic:  2790 on 9 and 16864 DF,  p-value: < 2.2e-16
# Plot the model 
plot(best.multi.lm)

4 Discussion

4.1 General

It seems that the step-forward algorithm has been successfully implemented. However, it is important to choose the predictors very carefully because the final model could vary. The target, of course, must be removed before we can use the algorithm because if we use the target as a predictor, RSQ will always be 1. Additionally, we must be aware of linear combinations. We must always know what each variable stands for. The linear combination could disrupt our results. After that, we must decide whether the variable is a meaningful predictor. For example, a time factor could be very important. Maybe the experiment setting was outdoors and the climate factors were important. It could also be possible that our experimental setting was indoors and there were always labor conditions. That means, that the date and time are irrelevant. It is not always clear what a “good choice” is.

For task 1, we could have written a very similar function as for task 2. But we wrote a function that is a little bit more complicated than necessary. We did this to demonstrate how the algorithm works. The algorithm calculates a linear regression model for each predictor. With the table, we are able to see that the model with the highest RSQ is not necessarily the model with the lowest AIC.

We also wrote a function for task two. It is a step-forward algorithm. First, the function calculates a bivariate linear regression model and chooses the one with the highest RSQ. After that, a multivariate regression model is calculated, and again, the one with the highest RSQ is chosen. For each round, we compare the AIC. If the AIC is higher than the model before, we stop the calculation, and we have probably found our best-fit model. But here we have the same problem as described above. Our calculation depends on our input. Therefore, we need to consider which variables we include in the calculation.

For demonstration, we made some function calls. We can easily see that the results are different. That means we must (1) choose wisely (maybe with an educational guess), (2) try different calls to be able to estimate how big the difference is, and (3) document how and why we have decided.

4.2 Discussion of plot(model) call

Inspired by: https://www.statology.org and the book: “Statistik” form L. Fahrmeir

4.2.1 For the bivariate model

Residuals vs. fitted plot

We can see the right line follows almost the dashed line. It is near zero and approximately a horizontal line. That means the linearity is not violated and the assumption that we use a linear regression model was ok.

Q-Q-Plot

The Q-Q-Plot shows us that the residuals are almost following a normal distribution. The error is probably mainly statistical and not systematic. The plot also shows that the choice of a linear model was the right one because it follows almost the line with a 45° angle (the values would then have a perfect normal distribution).

Scale-Location Plot

The red line should be roughly horizontal across the plot. If so, then the assumption of homoscedasticity is probably right. The scattering should be random, and there should be no pattern at all. Both are approximately true. The spread of the residuals is roughly equal at all fitted values.

Leverage-Plot

There are no values behind the cook’s distance. That means that there are no significant outliers that distort our analysis.

4.2.2 For the multivariate model

The plots are approximately the same as for the bivariate model and need no more discussion. Only the scale-location plot needs a few words: The red line is not horizontal. That means there could be a problem with homoscedasticity. But it is not so dramatic and does not need an intervention. But we should keep this in mind.

4.3 Development of RSQ and AIC

We can see that RSQ increases quickly for the first few steps. But then it slows down, and suddenly there is no further improvement, and we can stop our model. For the AIC, we can see similar behavior. But AIC decreases instead of increasing. First, the decrease is fast and slows down. In the last step, the AIC increased a bit. That is when we stop our calculation.

The RSQ is based on the variance decomposition theorem, which states that the variance of the dependent variable can be written as the sum of a part of the variance explained by the regression model and the variance of the residuals (unexplained variance). By definition, it increases with additional variables, but the growth flattens out. Actually, we should discuss the adjusted RSQ instead of the RSQ. The RSQ is always a bit higher than the adjusted RSQ. But it does not matter which one we use in the algorithm. The algorithm stops if the AIC increases. If we choose a final model, then we have to use the adjusted RSQ. AIC is a kind of quality measure for the models. It is based on log-likelihood and a correction term that takes into account the number of variables. In the beginning, AIC becomes smaller. When the influence of the correction term becomes too large, the AIC does not get smaller anymore, and we have a basis for decision-making for the model choice.

It is also important to note that RSQ does not simply add up but must be recalculated constantly. This means that for a multivariate model, we cannot simply use the highest RR of the individual bivariate models or the variables with the highest correlation coefficients. This is visible in the figures “visualization of the development of RR and AIC (bivariate). That is why we have to use the AIC for a good model choice, and that is also why we recalculate the RR in our step-forward algorithm in each round.